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Abstract 

Techniques involving factorization are found in a wide range of applications and have enjoyed 
significant empirical success in many fields. However, common to a vast majority of these problems 
is the significant disadvantage that the associated optimization problems are typically non-convex 
due to a multilinear form or other convexity destroying transformation. Here we build on ideas from 
convex relaxations of matrix factorizations and present a very general framework which allows for 
the analysis of a wide range of non-convex factorization problems - including matrix factorization, 
tensor factorization, and deep neural network training formulations. We derive sufficient conditions 
to guarantee that a local minimum of the non-convex optimization problem is a global minimum 
and show that if the size of the factorized variables is large enough then from any initialization 
it is possible to find a global minimizer using a purely local descent algorithm. Our framework 
also provides a partial theoretical justification for the increasingly common use of Rectified Linear 
Units (ReLUs) in deep neural networks and offers guidance on deep network architectures and 
regularization strategies to facilitate efficient optimization. 
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1. Introduction 


Models involving factorization or decomposition are ubiquitous across a wide variety of technical 
fields and application areas. As a simple example relevant to machine learning, various forms of 
matrix factorization are used in classical dimensionality reduction techniques such as Principle 
Component Analysi s (PCA) and in r nore recent methods like non-negative rnatrix f actorization or 
dictionary learning ( Lee and Seung . 1999 : Aharon et al. . 20061: Mairal et al. . 2010 1. In a typical 
matrix factorization problem, we might seek to find mafrices {U,V) such fhaf the product UV'^ 
closely approximates a given data matrix Y while at the same time requiring that U and V satisfy 
certain properties (e.g., non-negativity, sparseness, etc.). This naturally leads to an optimization 
problem of the form 

mini{Y, UV^) + 0(f7, V) (1) 


where £ is some function that measures how closely Y is approximated by UV'^ and 0 is a regu¬ 
larization function to enforce the desired properties in U and V. Unfortunately, aside from a few 
special cases (e.g., PCA), a vast majority of matrix factorization models suffer from the signifi¬ 
cant disadvantage that the associated optimization problems are non-convex and very challenging 
to solve. For example, in ([Hi even if we choose &{U, V) to be jointly convex in ([/, V) and £{Y, X) 
to be a convex function in X, the optimization problem is still typically a non-convex problem in 
{U, V) due to the composition with the bilinear form X = UV'^. 

Given this challenge, a common approach is to relax the non-convex factorization problem into 
a problem which is convex on the product of the factorized matrices, X = UV'^. As a concrete 
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example, in low-rank matrix factorization, one might be interested in solving a problem of the form 

mm^(y, UV'^) subject to ra.nk{UV^) < r (2) 


where the rank constraint can be easily enforced by limiting the number of columns in the U and V 
matrices to be less than or equal to r. However, aside from a few special choices of £, solving ((H) 
is a NP-hard problem in general. Instead, one can relax Q into a fully convex problem by using a 
convex regularization that promotes low-rank solutions, such as the nuclear norm ||2f||*, and then 
solve 


mm£{Y,X) + X\\Xl 

X 


(3) 


which can be done efficiently if £(Y,X) is convex with respect to X (ICai et al.L 120081: iRecht et al. 
2010n . Given a solution to (l3]l, Xgpt, it is then simple to find a low-rank factorization UV'^ = Xopt 
via a singular value decomposition. Unforunately, however, while the nuclear norm provides a nice 
convex relaxation for low-rank matrix factorization problems, nuclear norm relaxation does not 
capture the full generality of problems such as o as it does not necessarily ensure that Xopt can 
be ’efficiently’ factorized as X^pt = UV'^ for some [U, V) pair which has the desired properties 
encouraged by 0(C/, V) (sparseness, non-negativity, etc.), nor does it provide a means to find the 
desired factors. To address these issues, in this paper we consider the task of solving non-convex 
optimization problems directly in the factorized space and use ideas inspired from the convex re¬ 
laxation of matrix factorizations as a means to analyze the non-convex factorization problem. Our 
framework includes problems such as ([Til as a special case but also applies much more broadly to a 
wide range of non-convex optimization problems; several of which we describe below. 


1.1 Generalized Factorization 


More generally, tensor factorization models provide a n atural extension to rnatrix factorization 
and h ave been employed in a wide variety of applications dCichocki et all 120091: iKolda and Baden 
2009(1 . The resulting optimization problem is similar to matrix factorization, with the difference that 
we now consider more general factorizations which decompose a multidimensional tensor Y into 
a set of K different factors ..., X^), where each factor is also possibly a multidimensional 
tensor. These factors are then combined via an arbitrary multilinear mapping <I>(X, X^) Y ; 
i.e., <I> is a linear function in each X® term if the other X^, i f j terms are held constant. This model 
then typically gives optimization problems of the form 


^ min ^ i{Y, <^{X \..., X^)) + 0(X\ ..., X 


K\ 


(4) 


where again i might measure how closely Y is approximated by the tensor <I>(X, X^) and 0 
encourages the factors {X^,, X^) to satisfy certain requirements. Clearly, ([Hi is a generalization 
of (dll by taking (X^,X^) = {U, V) and V) = UV'^, and similar to matrix factorization, the 
optimization problem given by dUl will typically be non-convex regardless of the choice of 0 and £ 
functions due to the multilinear mapping <I>. 

While the tensor factorization framework is very general with regards to the dimensionalities of 
the data and the factors, a tensor factorization usually implies the assumption that the mapping $ 
from the factorized space to the output space (the codomain of <I>) is multilinear. However, if we 
consider more general mappings from the factorized space into the output space (i.e., mappings 
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which are not restricted to be multilinear) then we can capture a much broader array of models 
in the ’factorized model’ family. For example, in deep neural network training the output of the 
network is typically generated by performing an alternating series of a linear function followed by 
a non-linear function. More concretely, if one is given training data consisting of N data points of d 
dimensional data, V € and an associated vector of desired outputs Y € the goal then is 

to find a set of network parameters ..., X^) by solving an optimization problem of the form 
(01) using a mapping 

^x \..., X^) = V’iT(V'x-i(... ... X^-1)X^) (5) 

where each X* factor is an appropriately sized matrix and the functions apply some form of 
non-linearity after each matrix multiplication, e.g., a sigmoid function, rectification, max-pooling. 
Note that although here we have shown the linear operations to be simple matrix multiplications 
for notational simplicity, this is easily generalized to other linear operators (e.g., in a convolutional 
network each linear operator could be a set of convolutions with a group of various kernels with 
parameters contained in the X* variables). 


1.2 Paper Contributions 


Our primary contribution is to extend ideas from convex matrix factorization and present a general 
framework which allows for a wide variety of factorization problems to be analyzed within a convex 
formulation. Specifically, using fhis convex framework we are able fo show fhaf local minima of fhe 
non-convex facforizafion problem achieve fhe global minimum if fhey safisfy a simple condition. 
Furfher, we also show fhaf if fhe facforizafion is done wifh factorized variables of sufficienl size, 
fhen from any inifializafion if is always possible fo reach a global minimizer using purely local 
descenf search sfrafegies. 

Two concepfs are key to our analysis framework: 1) fhe size of fhe factorized elemenfs is nol 
consfrained, buf instead til fo fhe dafa fhrough regularization (for example, fhe number of columns 
in U and V is allowed to change in mafrix facforizafion) 2) we require fhaf fhe mapping from fhe 
facforized elemenfs fo fhe final oufpuf, <1>, satisfies a posifive homogeneify properfy. Inferesfingly, 
fhe deep learning field has increasingly moved fo using non-linearifies such as Recfified Linear 
Unifs (ReLU) and Max-Pooling, bofh of which satisfy fhe posifive homogeneify properfy, and if has 
been noted empirically fhaf bofh fhe speed of fraining fhe neural nefwork and fhe overall perfor¬ 
mance of fhe nefwork is increased significanfly when ReLU n on-linearifies are used insfead of fhe 
more fradifional hyperbolic fangenf or sigm oid non-linearifies (IDahl ef al.Ll2013l:IMaas ef al.Ll2013l: 
Krizhevsky ef ah . 2012 : Zeiler ef ^boish . We suggesf fhaf our framework provides a partial fhe- 
orefical explanafion fo fhis phenomena and also offers directions of fulure research which mighf be 
beneficial in improving fhe performance of mulfilayer neural nefworks. 


2. Prior Work 

Despife fhe significanf empirical success and wide ranging applications of fhe models discussed 
above (and many ofhers nol discussed), as we have mentioned, a vasl majorily of fhe above tech¬ 
niques models suffer from fhe significanf disadvanlage fhaf fhe associated optimization problems 
are non-convex and very challenging to solve. As a resull, fhe numerical optimization algorifhms 
oflen used fo solve facforizafion problems - including (buf cerfainly nol limited to) allemaling mini¬ 
mization, gradienl descenf, stochastic gradienf descenf, block coordinate descenf, back-propagalion. 
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and quasi-newfon mefhods 

- are Ivpicallv onlv guaranfeed fo converge fo a critical poinl or local 

minimum of fhe obiecfive funcfion dMairal el al.. 20101: 

Rumelharf el al.. 

1988: 

Ngiam el al.. 2011: 

Wrighl and Nocedal. 

1999: 

Xu and Yin, 

20131. The nuclear norm relaxation of low-rank malrix 


factorization discussed above provides a means to solve factorization problems with reglarization 
promoting low-rank solution^ but it fails to capture the full generality of problems such as o 
as it does not allow one to find factors, {U,V), with the desired properties encouraged by &{U,V) 
(sparseness, non-negativity, etc.). To address this issue, several studies have explored a more general 
convex relaxation via the matrix norm given by 


||X||„ .„ = inf inf 'V] \\U, 

r(=N, TTV TIVT-y " 


IMnM ’'i\\v 


reN+ U,V-.UVT=X 

inf inf ^(||t^i|ln + 

rm+Uy-.UVT=X^^ 


2 = 1 
r 


( 6 ) 


*11^) 


2 = 1 


where {Ui,Vi) denotes the i’th columns of U and V, || • and 
and the number of columns (r) in the U and V matrices is allowed to be variable (iBach et af 


are arbitrary ve ctor norms 


2008 : Bachl 2013 : Haeffele et al. . 2014). The norm in ® has appeared under multiple names in 


the literature, including the projective tensor norm, decomposition norm, and atomic norm, and 
by replacing the column norms in ® with gauge functions the formulation can be generalized to 
incorporate add itional regul arization on {U,V), such as non-negativity, while still being a convex 
function of X ( Bach . 2013b . Further, it is worth noting that for particular choices of the || • ||u 
and II • lit, vector norms, ||X||„^t, reverts to several well known matrix norms and thus provides a 
generalization of many commonly used regularizers. Notably, when the vector norms are both I 2 
norms, ||2f || 2,2 = ||-^||*> and the form in dQ is the well known variational definition of the nuclear 
norm. 

The II • llu,,; norm has the appealing property that by an appropriate choice of vector norms || • ||n 
and II • 11^ (or more generally gauge functions), one can promote desired properties in the factorized 
matrices (U, V) while still working with a problem which is convex w.r.t. the product X = UV'^. 
Based on this concept, several studies have explored optimization problems over factorized matrices 
{U,V) of the form 

(V) 


mmi{UV^) + X\\UV^ 


uy 


Even though the problem is still non- c onvex w.r.t. the factorized matrices (C/,1/), it can be shown 
using ideas from iBurer and Monteiroi (l2005b on factorized semidefinite program ming that, s ubject 


to a few gene; r al con ditions, then local minima of (jTll will be global minima (IBach et al.L 12008 


Haeffele et al.L l2014b . which can significantly reduce the dimensionality of some large scale opti¬ 


mization problems. Unfortunately, aside from a few speci al cases, the norm defined by (O (and 
relafed regularizafion funcfions such as fhose discussed by IBachl (l2013h l cannof be evaluafed effi- 
cienfly, much less opfimized over, due fo fhe complicafed and non-convex nafure of fhe definition. 
As a resulf, in practice one is offen forced fo replace ([Til by fhe closely relafed problem 


minC(C/l/^) -h A ^ 

’ 2=1 


C/,: 


= m.m£(UV^ 

uy 


+ ^T.U\\u^\ 


+ 


( 8 ) 


i=l 


1. Similar convex relaxation techniques have also been proposed for low-rank tensor factorizations, but in the case 
of tensor s finding a fina l factorization Xo^t ~ 4> (X^, ■ • ■, X^) from a low-rank tensor can still be a challenging 
problem iTomioka et alll20l^ : lGandv et alil201 ill 
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However, ([71) and ([8]) are not equivalent problems, due to the fact that solutions to (jT]) include 
any factorization {U, V) such that their product equals the optimal solution, UV'^ = Xopt, while 
in Q one is specifically searching for a factorization {U,V) that achieves the infimum in ®; in 
brief, solutions to ([Hi will be solutions to ([7]l, but the converse is not true. As a consequence, 
results guaranteeing that local minima of the form ([Tl) will be global minima cannot be applied to 
the formulation in ([8]), which is typically more useful in practice. Here we focus our analysis on 
the more commonly used family of problems, such as ([H, and show that similar guarantees can be 
provided regarding the global optimality of local minima. Additionally, we show that these ideas can 
be significantly extended to a very wide range of non-convex models and regularization functions, 
with applications such as tensor factorization and certain forms of neural network training being 
additional special cases of our framework. 

In the context of neural networks, IBengio et all (120050 showed that for neural networks with 
a single hidden layer, if the number of neurons in the hidden layer is not fixed, but instead fit to 
the data through a sparsity inducing regularization, then the process of training a globally optimal 
neural network is analgous to selecting a finite number of hidden units from the infinite dimensional 
space of all possible hidden units and taking a weighted summation of these units to produce the 
output. Further, th ese ideas hav e very recently been used to analyze the generalization performance 
of such networks (IBachl. l2014t) . Here, our results take a similar approach and extend these ideas 
to certain forms of multi-layer neural networks. Additionally, our framework provides sufficient 
conditions on the network architecture to guarantee that from any intialization a globally optimal 
solution can be found by performing purely local descent on the network weights. 


3. Preliminaries 

Before we present our main results, we first describe our notation system and recall a few definitions. 

3.1 Notation 

Our formulation is fairly general in regards to the dimensionality of the data and factorized variables. 
As a result, to simplify the notation, we will use capital letters as a shorthand for a set of dimensions, 
and individual dimensions will be denoted with lower case letters. For example, X € i^Ax...xdjv = 
X G for D = di X ... X d^', we also denote the cardinality of D as card(i2) = 

Similarly, X G = X £ ^dix...xd,^xrix...xrM for D = dix .. .xdN and R = nx .. .xtm- 

Given an element from a tensor space, we will use a subscript to denote a slice of the tensor 
along the last dimension. For example, given a matrix X G then 2fjGMf,iG{l,...,r}, 

denotes the i’th column of X. Similarly, given a cube X G ^ ^ g 

{1..., r}, denotes the i’th slice along the third dimension. Further, given two tensors with matching 
dimensions except for the last dimension, X G and Y G we will use y] G 

R^x(»’a:+?-!/) (o denote the concatenation of the two tensors along the last dimension. 

We denote the dot product between two elements from a tensor space {x G R^,y G R^) as 
{x,y) = vec{x)'^vec{y), where vec{-) denotes flattening the tensor into a vector. For a function 
9{x), we denote its image as lm(0) and its Fenchel dual as 6*{x) = sup^ (x, z) —9{z). The gradient 
of a differentiable function 9{x) is denoted V0(x), and the subgradient of a convex (but possibly 
non-differentiable) function 9{x) is denoted d9{x). For a differentiable function with multiple 
variables ..., x^), we will use ... , x^) to denote the portion of the gradient corre- 
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spending to x*. The space of non-negative real numbers is denoted M_|_, and the space of positive 
integers is denoted N+. 


3.2 Definitions 

We now make/recall a few general definitions and well known facts which will be used in our 
analysis. 

Definition 1 A size-r set of K factors ..., is defined to be a set of K tensors where the 

final dimension of each tensor is equal to r. This is to be interpreted {X^, ..., X^)r G x 

... X 


Definition 2 The indicator function of a set C is defined as 


dc{x) 


0 X € C 
oo X ^ C 


(9) 


Definition 3 A function 6 : x ... x —>■ is positively homogeneous with degree p if 

6{ax ^,..., ax^) = q:^0(x^, ..., x'^), Va > 0. 

Note that this definition also implies that 0(0,..., 0) = 0 for p 7 ^ 0. 

Definition 4 A function 9 : x ... x > M+ is positive semidefinite if 9{0 ,..., 0) = 0 

and 9{x ^,..., x^) > 0, V(x^,..., x^). 


Definition 5 The one-sided directional derivative of a function 9{x) at a point x in the direction z 
is denoted 6{x){z) and defined as d6{x){z) = lime\o {0{x + ez) — 6{x))e~^. 

Also, recall that for a differentiable function 0(x), d9{x){z) = {V6{x),z). 


4. Problem Formulation 

Returning to the motivating example from the introduction (|4l), we now define the family of mapping 
functions from the factors into the output space and the family of regularization functions on the 
factors (<h and 0 , respectively) which we will study in our framework. 

4.1 Factorization Mappings 

In this paper, we consider mappings <I> which are based on a sum of what we refer to as an elemental 
mapping. Specifically, if we are given a size-r set of K factors (X^,... ,X^)r, the elemental 
mapping x ... x R^ —R^ takes a slice along the last dimension from each tensor in 

the set of factors and maps it into the output space. We then define the full mapping to be the sum 
of these elemental mappings along each of the r slices in the set of factors. The only requirement 
we impose on the elemental mapping is that it must be positively homogeneous. More formally. 

Definition 6 An elemental mapping, '■ x ... x R^^ —> R^ is any mapping which is 
positively homogeneous with degree p f 0. The r-element factorization mapping x 

... X ^ R^ is defined as 

r 

, x^) = </>(Xi,..., Xf). (10) 

i=l 
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As we do not place any restrictions on the elemental mapping, (/>, beyond the requirement that it 
must be positively homogeneous, there are a wide range of problems that can be captured by a 
mapping with form (fTOl) . Several example problems which can be placed in this framework include: 
Matrix Factorization : The elemental mapping, 4> '■ x ]^d.ixd 2 

(p{u, v) = uv'^ (11) 

is positively homogeneous with degree 2 and ^r{U, V) = — UV'^ is simply matrix 

multiplication for matrices with r columns. 

Tensor Decomposition - CANDECOMP/PARAFAC (CP) : Slightly more generally, the elemen¬ 
tal mapping 4> : x ... x —>■ 

, x^) = ( 8 ) • • • ( 8 ) x^ ( 12 ) 


(where i 8 > denotes the tensor outer product) results in <l>r( 2 f^,..., X^) b eing the mapping used i n 
the rank-r CANDECOMP/PARAFAC (CP) tensor decomposition model dKolda and Bader . 20091) . 
Further, instead of choosing (/> to be a simple outer product, we can also generalize this to be any 
multilinear function of the factors (Xf ,..., 

Neural Networks with Rectified Linear Units (ReLU): Fet ip~^{x) = max{x, 0} be the linear 
rectification function, which is applied element-wise to a tensor x of arbitrary dimension. Then if 


we are given a matrix of training data V G the elemental mapping cj){x^ 


hNxd2 




(j)(x^,x‘^) = ^l>~^{Vx^)(x 


2\T 


(13) 


results in a mapping X^) = (VX^){X‘^)'^, which can be interpreted as producing the c /2 

outputs of a 3 layer neural network with r hidden units in response to the input of N data points of 
di dimensional data, V. The hidden units have a ReFU non-linearity; the other units are linear; and 
the (X^,X'^) G X j^atrices contain the connection weights from the input-to-hidden 

and hidden-to-output layers, respectively. 

By utilizing more complicated definitions of (j), it is possible to consider a broad range of neural 
network architectures. As a simple example of networks with multiple hidden layers, an elemental 
mapping such as (j) : ^ ^d 2 Xd 3 ^ ^d3Xd4 ^ ]gd 4 Xd 5 jgArxds 


4>{x^, x'^, x^, x'^) = (Vx^)x‘^)x^)x‘^ 


(14) 


gives a X^) mapping which is the output of a 5 layer neural network in response 

to the inputs in the V G matrix with ReFU non-linearities on all of the hidden layer units. 

In this case, the network has the architecture that there are r, 4 layer fully-connected subnetworks, 
with each subnetwork having the same number of units in each layer as defined by the dimensions 
{c/ 2 , c/ 3 , c/ 4 }. The r subnetworks are all then fed into a fully connected linear layer to produce the 
output. 

More general still, since any positively homogenous transformation is a potential elemental 
mapping, by an appropriate definition of (p, one can describe neural networks with very general 

2. We note that more general tensor decompositions, such as the general form of the Tucker decomposition, do not 
explicitly fit inside the framework we describe here; however, by using similar arguments to the ones we develop 
here, it is possible to show analogous results to those we derive in this paper for more general tensor decompositions, 
which we do not show for clarity of presentation. 
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architectures, provided the non-linearities in the network are compatible with positive homogene¬ 
ity. Note that max-pooling and rectification are both positively homogen eous and thus fall withi n 
our framework. For example, the well-known ImageNet network from (IKrizhevsky et al.L l2012h . 
which consists of a series of convolutional layers, linear-rectification, max-pooling layers, response 
normalization layers, and fully connected layers, can be described by taking r = 1 and defining 
(j) fo be fhe entire fransformafion of fhe nefwork (wifh fhe removal of fhe response normalizafion 
layers, which are nol positively homogenous). Note, however, fhaf our resulfs will rely on r pofen- 
fially changing size or being initialized fo be sufficienfly large, which limifs fhe applicabilify of our 
resulfs fo currenf sfafe-of-fhe-arf nefwork archifecfures (see discussion). 

Here we have provided a few examples of common facforizalion mappings fhaf can be casf 
in form (fTOl) . buf cerfainly fhere are a wide variefy of ofher problems for which our framework is 
relevanf. Addifionally, while all of fhe mappings described above are posifively homogeneous wifh 
degree equal fo fhe degree of fhe facforizafion (K), fhis is nof a requiremenf; p 7 ^ 0 is sufficienf. For 
example, non-linearifies such as a recfificalion followed by raising each elemenf fo a non-zero power 
are posifively homogeneous buf of a possibly differenl degree. Whaf will furn ouf fo be essenfial, 
however, is fhaf we require p fo mafch fhe degree of positive homogeneify used fo regularize fhe 
facfors, which we will discuss in fhe nexf secfion. 


4.2 Factorization Regularization 

Inspired by the ideas from structured convex matrix factorization, instead of trying to analyze the 
optimization over a size-r set of K factors ... ,X^)r for a fixed r, we instead consider the 
optimization problem where r is possibly allowed to vary and adapted to the data through regular¬ 
ization. To do so, we will define a regularization function similar to the || • norm discussed 
in matrix factorization which is convex with respect to the output tensor but which still allows for 
regularization to be placed on the factors. Similar to our definition in (fTOl) . we will begin by first 
defining an elemental regularization function g : x ... x > M_|_ U 00 which takes as in¬ 

put slices of the factorized tensors along the last dimension and returns a non-negative number. The 
requirements we place on g are that it must be positively homogeneous and positive semidefinite. 
Formally, 

Definition 7 We define an elemental regularization function g : X ... X M+ U 00 , 

to be any function which is positive semidefinite and positively homogeneous. 

Again, due to the generality of the framework, there are a wide variety of possible elemental 
regularization functions. We highlight two positive semidefinite, positively homogeneous functions 
which are commonly used and note that functions can be composed with summations, multiplica¬ 
tions, and raising to non-zero powers to change the degree of positive homogeneity and combine 
various functions. 

Norms'. Any norm ||x|| is positively homogeneous with degree 1. Note that because we make 
no requirement of convexity on g, this framework can also include functions such as the Ig pseudo¬ 
norms for q G(0,1). 

Conic Indicators'. The indicator function 6c{x) of any conic set C is positively homogeneous 
for all degrees. Recall that a conic set, C, is simply any set such that if x G C then ax G C, Va > 0. 
A few popular conic sets which can be of interest include the non-negative orthant M;^, the kernel 
of a linear operator {x : Ax = 0}, inequality constraints for a linear operator {x : Ax > 0}, 
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and the set of positive semidefinite matrices. Constraints on the non-zero support of x are also 
typically conic sets. For example, the set {x : ||x||o < n} is a conic set, where ||x||o is simply the 
number of non-zero elements in x and n is a positive integer. More abstractly, conic sets can also 
be used to enforce invariances w.r.t. positively homogeneous transformations. For example, given 
two positively homogeneous functions 9{x), 0'{x) with equal degrees of positive homogeneity, the 
sets {x : 9{x) = 9'{x)} and {x : 9{x) > 9'{x)} are also conic sets. 

A few typical formulations of a p which are positively homogeneous with degree K might 
include: 


K 


g(x\.. 

II 

(15) 

g(x\.. 

1 II iiiiT 

• ~ K 11^ ll(i) 

i=l 

K 

(16) 

g(x\.. 


(17) 


i=l 


where all of the norms, || • ||(j), are arbitrary. Forms (fTSl) and (fT^ can be shown to be equivalent, in 
the sense that they give rise to the same ^ function, for all of the example mappings cj) we have 
discussed here and by an appropriate choice of norm can induce various properties in the factorized 
elements (such as sparsity), whil e form ([TtIi is sirnilar but additionally constra i ns eac h factor to be 
an element of a conic set Q (see Bach et ah . 20081: Bach . 2013 : Haeffele et al. . 2014, for examples 
from matrix factorization). 

To define our regularization function on the output tensor, X = ... ,X^), it will be 

necessary that the elemental regularization function, g, and the elemental mapping, cp, satisfy a few 
properties to be considered ’compatible’ for the definition of our regularization function. Specifi¬ 
cally, we will require the following definition. 


Definition 8 Given an elemental mapping (p and an elemental regularization function g, will we say 
that {(p, g) are a nondegenerate pair if 1) g and (p are both positively homogeneous with degree p,for 
some p / 0 and 2) MX E Im((/))\0, E (0, oo] and (2^,..., z^) such that <p{z ^,..., z^) = X, 
g{z ^,..., z^) = p, and g{z ^,..., z^) > p for all {z ^,..., z^) such that (p{z ^,..., z^) = 

From this, we now define our main regularization function: 

Definition 9 Given an elemental mapping (p and an elemental regularization function g such that 
{(p,g) are a nondegenerate pair, we define the factorization regularization function, Q^^g{X) : 
—>■ M_|_ Li oo to be 


r 

ns„iX)= inf inf 

' rm+(xL...,x^)r^ 

1=1 

subject to ..., X^) = X 


(18) 


3. Property I from the definition of a nondegenerate pair will be critical to our formulation. Several of our results can 
be shown without Property 2, but Property 2 is almost always satisfied for most interesting choices of {(f>, g) and is 
designed to avoid ’pathological’ functions (such as Q.ti,,g{X) = 0 MX). For example, in matrix factorization 
with (plu, v) = uv^, taking g{u,v) = for any arbitrary norm and conic set C satisfies Property 1 but not 

Property 2, as we can always reduce the value of g{u, v) by scaling n by a constant a £ (0,1) and scaling u by a~^ 
without changing the value of 4>{u, v)- 
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with the additional condition that Q^^g{X) = oo if X |J^ Im(<^r)- 

We will show that ^ is a convex function of X and that in general the infimum in flS) can 
always be achieved with a finitely sized factorization (i.e., r does not need to approach ooji While 
^0 g suffers from many of the practical issues associated with the matrix norm || • discussed 
earlier (namely that in general it cannot be evaluated in polynomial time due to the complicated 
definition), because ^(j,^g{X) is a convex function on X, this allows us to use ^ purely as an 
analysis tool to derive results for a more tractable factorized formulation. 


4.3 Problem Definition 

To build our analysis, we will start by defining the convex (but typically non-tractable) problem, 
given by 

minF(X, Q) = t{X, Q) + Xn^^giX) + H{Q). (19) 

X,Q 

Here X € is the output of the factorization mapping X = , X^) as we have been 

discussing, and the Q term is an optional additional set of non-factorized variables which can be 
helpful in modeling some problems (for example, to add intercept terms or to model outliers in the 
data). For our analysis we will assume the following: 

Assumption 1 i{X, Q) is once differentiable and jointly convex in {X, Q) 

Assumption 2 H(Q) is convex (but possibly non-differentiable) 

Assumptions (</>, 5 ) are a nondegenerate pair as defined by Definition]^ i}^^g{X) is as defined 
by Cl; and A > 0 


Assumption 4 The minimum of F{X, Q) exists 0 argmin^^- q F{X, Q). 


As noted above, it is typically impractical to optimize over functions involving ^(X), and, 
even if one were given an optimal solution to d . Xopt, one would still need to solve the problem 
given in d to recover the desired (X ^, • • •, X^) factors. Therefore, we use (d merely as analysis 
tool and instead tailor our results to the non-convex optimization problem given by 


min fr{X^,...,X^,Q) = 

r ( 20 ) 

£{^r{X\ ...,X^),Q) + xY,9{X},..., xf) + H{Q). 

i=l 


We will show in the next section that any local minima of (Id is a global minima if it satisfies 
the condition that one slice from each of the factorized tensors is all zero. Further, we will also 
show that if r is taken to be large enough then from any initialization we can always find a global 
minimum of (l 20 l) by doing an opfimization based purely on local descent. 

4. In particular, the largest r needs to be is card(D), and we note that card(D) is a worst case upper bound on the 
size of the factorization. In certain cases the bound can be shown to be lower. As an example, Q^^g{X) = HVH* 
when (f){u,v) = uv^ and g{u,v) = ||ri|| 2 ||tt|| 2 . In this case the infimum can be achieved with r < rank(V) < 
min{card(ti), card(w)}. 
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5. Main Analysis 

We begin our analysis by first showing a few simple properties and lemmas relevant to our frame¬ 
work. 


5.1 Preliminary Results 

First, from the definition of it is easy to verify that if </> is positively homogeneous with degree 
p, then <l>r is also positively homogeneous with degree p and satisfies the following proposition 

Proposition 10 Given a size-r^ set of K factors, {X^,, X^)r^, and a size-Vy set of K factors, 
(Y\...,Y^)ry, then Va > 0,/3 > 0 

^{r.+ry){WX^ • • • , ^y^]) = • • • , + ^^^ry{y \• • • , 1 ^^) ( 21 ) 


where recall, [X Y] denotes the concatenation of X and Y along the final dimension of the tensor. 


Further, satisfies the following proposition: 

Proposition 11 The function M U oo defined in (1181) has the properties 

1. = 0 and Tl^^g{X) > 0 VX / 0. 

2 . Tl^^g is positively homogeneous with degree 1. 


3. Vl^^g{X + Y)< n^,g{X) + VL^,g{Y) V(A, y) 


4. Tl^^g{X) is convex w.r.t. X G 

5. The infimum in (El]) can be achieved with r < card(i2) \/X s.t. < oo. 

Proof Proposition ITT] Many of these properties can be shown in a simi lar fashion to results from 
the II • ||^j_„ norm discussed previously (IBach et al. 


20081 :lYu et aklmidT) . 


1) By definition and the fact that g is positive semidefinite, we always have Q^^g{X) > 0 VA. 

Trivially, = 0 since we can always take (A^,..., X^) = ( 0 , . .. , 0 ) to achieve the infi¬ 
mum. For A 7 ^ 0 , because (</>, g) is a non-degenerate pair then Yli=i diyl-: • • • > > 0 for any 

(A^,..., X^)r s.t. <l>r(A^,..., X^) = X and r finite. Property 5) shows that the infimum can 
be achieved with r finite, completing the result. 

2) For all a > 0 and any (A^,..., X^)r such that A = <hr(A^,..., X^), note that from 


i/PXK) = aX and ELi ..., ) = 


positive homogeneity ^..., a 

g{Xl ,..., Xf-). Applying this fact to the definition of (2^^ gives that Tlf^^g (aX) = 

(A). 

3) If either f2^ ^(A) = oo or Qfp^g{Y) = oo then the inequality is trivially satisfied. Considering 

any (A, Y) pair such that Tl^^g is finite for both A and Y, for any e > 0 let (A^,..., X^)^^ be an 
e optimal factorization of A. Specifically, (A^, • • • j X^) = A and Ei=i ^ 

Tl(j,^g{X) + e. Similarly, let (A^,..., Y^)^^ be an e optimal factorization of A. From Proposition [TOl 
we have ^ry+ry{[X^ ..., [A^ A^]) = A + A, so ^^,g{X + A) < E[=i g{X}. • ■ ■, ) + 

EjLi ) X^) < Q.^^g{X) + Tl^^g{Y) + 2e. Letting e tend to 0 completes the result. 

4) Convexity is given by the combination of properties 2 and 3. Further, note that properties 2 
and 3 also show that {A G : n<^^g(A) < oo} is a convex set. 
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5) Let r C be defined as 

T = {X : 4>{x^,... ,x^) = X, g{x^,...,x^) < 1} (22) 

Note that beeause ((/>, g) is a nondegenerate pair, for any non-zero € F there exists a E [1, cx)) 
such that aX is on the boundary of F, so F and its convex hull are compact sets. 

Further, note that F contains the origin by definition of </> and g, so as a result, g is equivalent 
to a gauge function on the convex hull of F 

„{X) = inf{/i : g > 0, X e g conv(F)} (23) 

Since the infimum w.r.t. g is linear and constrained to a compact set, it must be achieved. Therefore, 
there must exist gopt > 0, {0 E ■ Oi > 0 Vi, • • • > • 

4>{Zl, ..., zf) E F}^ffsuch that X = g^pt • • •, ) and 0^,g(X) = g^pt. 

This, combined with positive homogeneity, completes the result as we can take 
{Xf,..., Xf) = {{goptOi^^PZl,{gopt9if/PZf), which gives 

card(D) card(D) card(D) 

^J‘opt — ^<j),g{X) < ^ ^ : ■ ■ ■ : E ) ~ f^opt ^ ^ 9ig{Zj^ , . . . , Z^ ) < gopt ^ ^ = l^-opt 

i=l i=l i=l 

(24) 

and shows that a factorization of size-card(i 2 ) which achieves the infimum must exist. ■ 


We next derive the Fenchel dual of g, which will provide a useful characterization of the 
subgradient of fl,/,,g. 


Proposition 12 The Fenchel dual ofTl(f)^g{X) is given by 


where 




0 ^lg{W)<l 
oo otherwise 




sup {W,(l){z^,...,z^)) 
subject to g{z ^,..., z^) < 1 


(25) 


(26) 


Proof Recall, i2^g(kF) = sup^ {W, Z) — Ft^^g{Z), so for Z to approach the supremum we must 
have Z E Im(6r). As result, the problem is equivalent to 

r 

o;g(iy)=sup sup (iy,4>,(z\...,z^))-Vp(z/,...,zf) (27) 

r 

= sup sup V[(fL,<))(Zi,...,Zf))-p(Z/,...,Zf)] ( 28 ) 

reN+ (zg...,ZK)r 

If %,g{W) < 1 then all the terms in the summation of (1281 ) will be non-positive, so taking 
{Z^,..., Z^) = (0, ...,0) will achieve the supremum. Conversely, if fl^g(kF) > 1, then 
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3 ( 2 ;^,..., z^) such that (yV, 4>{z^, ■ ■ ■, z^)) > g{z ^,..., z^). This result, combined with the 
positive homogeneity of cj) and g gives that (|2^ is unbounded by considering (az^,..., az^) as 
a —>■ 00 . ■ 


We briefly note that the optimization problem associated with S26\i is typically referred to as the 
polar problem and is a generalization of the concept of a dual norm. In practice solving the polar 
can still be very challenging an d is often the limiting factor in applying our results in practice (see 


BachLl2013l : IZhang et al.Ll2013L for further information). 


With the above derivation of the Fenchel dual, we now recall that if < 00 then the sub¬ 

gradient of Q^^g{X) can be characterized by d^fp^g{X) = {W : {X, W) = il^yx) + Vy^W)}. 
This forms the basis for the following lemma which will be used in our main results 


Lemma 13 Given a factorization X = ... ,X^) and a regularization function 

then the following conditions are equivalent: 


1. {X ^,..., X^) is an optimal factorization of X; i.e., • • • > 

2. 3FF such that %yW) < 1 and (IL, .. .,X^)) = g{Xl ,... ,Xf), 

3. 3W such that Q°^g{W) < land^i G {!,... ,r}, {Wy{X},... ,Xf)) = 5 (X/,...,Xf) 
Further, any W which satisfies condition 2 or 3 satisfies both conditions 2 and 3 and W G 

dn^yx). 


Proof 2 3) 3 trivially implies 2 from the definition of <5^. For the opposite direction, because 

12^ g(FF) < 1 we have (W, f{Xl ,..., Xf)) < g{Xl ,..., Xf-) Mi. Taking the sum over i, we 
can only achieve equality in 2 if we have equality Mi in condition 3. This also shows that any W 
which satisfies condition 2 or 3 must also satisfy the other condition. 

We next show that if W satisfies conditions 2/3 then W G dQ^^g{X). First, from condition 
2/3 and the definition of we have Ft^yX) < diX} ,..., X^) = (FF, X) < 00 . Thus, 

recall that because Fl^yX) is convex and finite at X, we have (IF, X) < Fl^yX) + FF^yW) 
with equality iff VF G dFl^yX). Now, by contradiction assume IF satisfies conditions 2/3 but 
IF ^ dFl^yX). From condition 2/3 we have (IF) = 0, ^oFl^yX) = Fl^yX) + Fl*^ ^iW) > 
{X, IF) = 9 {Xl , • • •, X^) which contradicts the definition of Fl^p^g^X). 

1 ^ 2) Any IF G dFl^jX) satisfies (X, IF) = Fl^yX) + Ft*^ g{W) = 

E:=i9ixy...,xy). 

2 => 1) By contradiction, assume (X^, ... ,X^)r was not an optimal factorization of X. 

This gives, Fl^yX) < Zl=i 9{Xl..., Xf) = (IF, X) = Fl^yx) + f2;^(IF) = Fl^,g{X), 
producing the contradiction. ■ 


Finally, we show one additional lemma before presenting our main results. 

Lemma 14 If (X^,..., X^, Q) is a local minimum of fr{X ^,..., X^, Q) as given in (1201) . then 
for any 0 G 

/ r \ r 

l-\Xx^{^r{X\...,X^),Q),Y,ey{X},...,Xf)\=Y,e,g{X},...,Xf) (29) 

\ i=l / i=l 
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Proof Let = {OiX},... ,9iX^) for all i G {l.-.r} and let A = 

Yli=i ■ ■ ■ ’ From positive homogeneity and the fact that we have a local minimum, 

then 3(5 > 0 such that Ve € (0, 5) we must have 


fr{X\...,X^,Q)<U{X^ + eZ\...,X^ + eZ^,Q) = 

r 

...,X^),Q) + XJ29{XI,..., Xf) + H{Q) < 


2 = 1 


(30) 


(31) 


W ^(1 + ee.fcPiXl, ... ,xf ),Q + A ^^(1 + eeifgix},. .. ,Xf) + HiQ) 


v2=l 


2=1 


Taking the first order approximation [1 + e6i)P = 1 + peOi + O(e^) and rearranging the terms of 
(OTI) . we arrive at 


0 <£ ...,X^)+peA + 0(e2), Q) - £{^r{X\. ..,X^),Q) 

r 

+ peXY,e,g{X},...,Xf) + 0{e^) 


(32) 


2 = 1 


Taking lim^^ol^J^ we note that the difference in the £{■■,■) terms gives the one-sided directional 
derivative d£{^r{X ^,..., X^),Q){pA, 0), thus from the differentiability of £ we get 

r 

0<{Vx£{MX\...,X^),Q),pA)+pXj2^i9iXl,...,Xf) (33) 

2=1 

Noting that for e > 0 but sufficiently small, we also must have fr{X^,..., X^, Q) < fr{X^ — 
eZ ^,..., X^ — eZ^), using identical steps as before and taking the first order approximation (1 — 
e0i)P = 1 - pe9i + O(e^), we get 

0 <£{MX\ ...,X^)-peA + 0(e2), Q) - e{^r{X\ X^), Q) 

- peX , Xf) + 0(e2) 

2=1 

and taking the limit lime^o[^]> we arrive at 

r 

0 < {Vx£iMX\ ...,X^),Q), -pA) -pXj2 0^9iXl ,..., Xf) (35) 

2=1 

Combining (1331) and (1351) and rearranging terms gives the result. ■ 


5.2 Main Results 

Based on the above preliminary results, we are now ready to state our main results and several 
immediate corollaries. 
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Theorem 15 Given a Junction fr{X ^,..., , Q) of the form given in (I20I) . any local minimizer 

of the optimization problem 


min 


MX\...,X^,Q) = 


..., Q) + A ,..., Xf) + H{Q) 

i=l 


(36) 


such that {Xf^, ■ ■ ■, XX) = (0,..., 0) for some zq G {1,... ,r} is a global minimizer. 


Proof Theorem [15] 

We begin by noting that from the definition of Q^g{X), for any factorization X = 
<^r{X\...,X^) 

F{X, Q) = 1{X, Q) + \^^,g{X) + H{Q) < 

’’ ny) 

l{^r{X\ ... , X^), Q) + A 5 (X/,..., Xf) + H{Q) = MX\...,X^,Q) 

i=l 

with equality at any factorization which achieves the infimum in (fT^ . We will show that a local 
minimum of /j.(X^,..., X^, Q) satisfying the conditions of the theorem also satisfies the condi¬ 
tions for (<hr(X^,..., X^), Q) to be a global minimum of the convex function F{X, Q), which 
implies a global minimum of /^(X^,..., X^, Q) due to the global bound in (IJ/I) . 

First, because ([T^ is a convex function, a simple subgradient condition gives that (X, Q) is a 
global minimum of F{X, Q) iff the following two conditions are satisfied 

-AVx^(X,g) GaO^,g(X) (38) 

-Vq£(X,Q) GaF(g) (39) 


where Vx^(X, g) and Vq^(X, g) denote the portions of the gradient of 1{X, Q) corresponding to 
X and g, respectively. If (X^, ..., X^, Q) is a local minimum of fr{X^,..., X^, Q), then (l39l) 

must be satisfied af (X, g) = (< hr(X^. X^), Q), as this is implied by the first order optimality 

condition for a local minimum ( Rockafellar and Wets . 1998i Chap. 10), so we are left to show that 
is also satisfied. 


Turning to the factorization objective, if (X^,..., X^, g) is a local minimum of 
fr{X^,... ,X^,Q), then \/{Z^,..., Z^)r there exists (5 > 0 such that Ve € (0,(5) we have 
fr{X^ + e^^PZ^,... ,X^ + e^^PZ^,Q) > fr{X^,..., X^,Q). If we now consider search di¬ 
rections {Z ^,..., Z^)r of the form 


{z},...,zf) 


( 0 ,..., 0 ) 
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where io is the index such that ,..., X^) = (0,..., 0), then for e € (0, <5), we have 

r 

mr{X\ ...,X^),Q) + X^g{Xl,..., Xf) + H{Q) < 

i=l 

..., + e^/PZ^), Q)+ 

r 

A g{X} + e^/PZ },..., Xf + ) + h{Q) = 

i=l 

• • •, X^) + ..., z^),Q)+ 

r 

Xf) + eXg{z\ ...,z^) + H{Q). 

i=l 


(41) 


(42) 


(43) 


The equality between (1421) and (|4^ comes from the special form of Z given by (l40l ) and the positive 
homogeneity of 0 and g. Rearranging terms, we now have 


[£(4.,(X\ ..., X^) + ecl>{z\ ..., z^), Q) - l{^r{X\ ..., X^),Q)] > 

-Xg{z^,...,z^). 


(44) 


Taking the limit lime \ 0 of (l44l) . we note that the left side of the inequality is simply 
the definition of the one-sided directional derivative of £(4>r(X^,..., X^), Q) in the direction 
{<i){z^ j ..., z^),0), which combined with the differentiability of £{X, Q), gives 

{cj,{z\ ..., z^), Xxi{MX\. ■ ■, X^),Q)) > -Xgiz\ z^). 

Because (z^,..., was arbitrary, we have established that 

{^{z\ ..., z^), - 1 Vx^($,(X\ ..., X^), Q)) < g{z\. ..,z^) V(z\ ...,z^) 
%J-j^Vxmr{X\...,X^),Q))<l 
Further, if we choose 9 to be vector of all ones in Lemma [141 we get 



^ g{Xl ..., Xf) = (4>f Xi,... ,X^), -j.VxiiMX\ • • • ,X^), Q)) (47) 

i=l 

which, combined with (l46l) and Lemma [T^ shows that —jXxK^riX^,...,X^),Q) G 
g($r(X^,..., X^)), completing the result. ■ 


From this result, we can then test the global optimality of any local minimum (regardless of 
whether it has an all-zero slice or not) from the immediate corollary: 

Corollary 16 Given a function /^(X^,... ,X^, Q) of the form given in (I20I) . any local minimizer 
of the optimization problem 

min fr{X\...,X^,Q) (48) 

{X\...,XK),,Q 

is a global minimizer if fr+i{[X^ 0],..., [X^ 0], Q) is a local minimizer of fr+i- 
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From the results of Theorem [T5l we are now also able to show that if we let the size of the 
factorized variables (r) become large enough, then from any initialization we can always find a 
global minimizer of fr{X^, ..., , Q) using a purely local descent strategy. Specifically, we 

have fhe following resulf. 

Theorem 17 Given a function fr{X^, ..., X^, Q) as defined by (l20l) . if r > card{D) then from 
any point {Z^, ..., Q) such that fr{Z^, ..., Z^, Q) < oo there must exist a non-increasing 

path from (Z^,..., Z^, Q) to a global minimizer of fr{X^, ..., X^, Q). 


Proof Theorem UtI 

Clearly if [Z^,... ,Z^,Q) is nol a local minimum, fhen we can follow a decreasing pafh 
unfil we reach a local minimum. Having arrived af a local minimum, {X^,... ,X^,Q), if 
{X ^,..., Xf) = (0,... , 0) for any z € { 1 ,... , r} fhen from Theorem [T5] we musf be af a global 
minimum. We are lefl fo show fhaf a non-increasing pafh fo a global minimizer musf exisf from any 
local minima such fhaf {X} ,..., X^^) ( 0 ,..., 0 ) for alH € { 1 ,..., r}. 

Lef us define fhe sef S = ^ ^ ^ Because 

r > card(L)) fhere musf exisf 0 G such fhaf 0 7 ^ 0 and ^ ^ ) — 

0. Furfher, from Lemma [141 we musf have fhaf ^ = 

(^-j.Wxmr{X\...,X^),Q),j::=JifiXl,...,Xi^)^ = O. Because > 

0, Vz G {1,..., r} Ibis implies fhaf af leasf one enfry of 9 musf be sfricfly less fhan zero. 

Wifhouf loss of generalify, scale 9 so fhaf minj^j = — 1. Now, for all ( 7 ,z) G {[0,1]} x 
{ 1 ,..., r}, lef us define 

{R]{y ),..., ( 7 )) = ((1 + l9,f/^X },..., (1 + (49) 


where p is fhe degree of posifive homogeneify of {(j),g). Nofe fhaf by consfrucfion 
(f2^(0),..., R^{0)) = ..., X^) and fhaf for 7 = 1 fhere musf exisf zq G {1,..., r} such 

fhaf(f?}^(l),...,i?,^(l)) = (0,...,0). 

Furfher, from fhe posifive homogeneify of {f, g) we have V 7 G [0,1] 


fr{R\y),...,R^{y),Q) =W ^ ... ,Xf) + ), g + 

\^=1 i=l / 

r r 

XyJ2^^9{Xl ,... ,Xf) + xY,9{XI, ... ,Xf) + H{Q) 


i=l 


i=l 


=e{Mx\ ...,x^),q) + xJ29{x},...,xI^) + h{q) 

=fr{X\...,X^,Q) 


i=l 


(50) 


(51) 

(52) 


where fhe equably belween (l50l) and (l5T]) is seen by recalling fhaf ^i4’{Xl, ■ ■ ■, Xf) = 0 and 
E:=J^9{xl...,xf) = o. 

As a resulf, as 7 goes from 0 — 1 we can Iraverse a pafh from {X^,..., X^,Q) —> 
{R^{1),..., R^{1),Q) wifhouf changing fhe value of fr- Also recall fhaf by consfrucfion 
(i?}^(l),..., Rf^ (1)) = ( 0 , ..., 0 ), so if (i?^(l),..., R^{1), Q) is a local minimizer of fr fhen if 
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Algorithm 1 (Local Descent Meta-Algorithm) 
input p - Degree of positive homogeneity for (</>, g) 
input ..., Q} - Initialization for variables 

while Not Converged do 

Perform local descent on variables {{X^,..., X^),Q} until arriving at a local minimum 
{{X\...,X^),Q} 

if 3io G {1 ,..., r} such that {Xf^,..., X^) = {0,... ,0) then 
{(X^,..., X^), Q} is a global minimum. Return, 
else 

if 30 € M'’\0 such that Yll=i • • • > ^ then 

Scale 6 so that min* 0j = — 1 

Set {XI ,... ) = ((1 - ei)^/PXl ,..., (1 - ), Vi € {1,... ,r} 

else 

Increase size of factorized variables by appending an all zero slice 
{X\...,X^)r+i = {[X^ 0],...,[X^ 0]) 

end if 

Set Q = g 
Continue loop 

end if 
end while 


must be a global minimizer due to Theorem [T5] If ..., R^{1), Q) is not a local minimizer 

then there must exist a descent direction and we can iteratively apply this result until we reach a 
global minimizer, completing the proof. ■ 


We note that from this proof we have also described a meta-algorithm (outlined in Algorithm 
[B which can be used with any local-descent optimization strategy to guarantee convergence to a 
global minimum. While in general the size of the factorization (r) might increase as the algorithm 
proceeds, as a worst case, it is guaranteed that a global minimum can be found with a finite r never 
growing larger than card(i3) -|- 1. Also note that this is a worst case upper bound on r for the most 
general form of our framework and that for specific choices of (j) and g fhe bound on fhe maximum 
r required can be significanfly lowered. 

Corollary 18 Algorithm\I\will find a global minimum of fr{X ^,..., X^,Q) as defined in (1201) . If 
r is intialized to be greater than card(Z?), then the size of the factorized variables will not increase. 
Otherwise, the algorithm will terminate with r < card(Z)) -|- 1. 

6. Discussion and Conclusions 

We begin fhe discussion of our resulfs wifh a cautionary nofe; namely, fhese resulfs can be chal¬ 
lenging fo apply in practice. In particular, many algorifhms based on alfernafing minimization can 
fypically only guaranfee convergence fo a critical poinf, and wifh fhe inherenf non-convexily of fhe 
problem, verifying whefher a given crifical poinf is also a local minima can be a challenging prob¬ 
lem on ifs own. Neverfheless, we emphasize fhaf our resulfs guaranfee fhaf global minimizers can 
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be found from purely local descent if the optimization problem falls within the general framework 
we have described here. As a result, even if the particular local descent strategy one chooses for 
a specific problem does not come with guaranteed convergence to a local minimum, the scope of 
the problem is still vastly reduced from a full global optimization. There is no need, in theory, to 
consider multiple initializations or more complicated (and much larger scale) techniques to explore 
the entire search space. 

6.1 Balanced Degrees of Homogeneity 

In addition to the above points, our analysis analysis also provides a few insights into the behavior of 
factorization problems and offers simple guidance on the design of such problems. The first is that 
balancing the degree of positive homogeneity between the regularization function and the mapping 
function is crucial. Here we have analyzed a mapping <1> with the particular form given in (fTOl) . 
We conjecture our results can likely be generalized to include additional factorization mappings 
(which we save for future work), but even for more general mappings and regularization functions, 
requiring the degrees of positive homogeneity to match between the regularization function and the 
mapping function will be critical to showing results similar to those we present here. In general, 
if the degrees of positive homogeneity do not match between the factorization mapping and the 
regularization function, then it either becomes impossible to make guarantees regarding the global 
optimality of a local minimum, or the regularization function does nothing to limit the size of the 
factorization, so the degrees of freedom in the model are largely determined by the user defined 
choice of r. As a demonsfrafion of fhese phenomena, firsf consider fhe case where we have a 
general mapping <h(X^,... ,X^) which is positively homogeneous wifh degree p (buf which is 
nol assumed fo have form (ITOll l. Now, consider a general regularization function G{X ^,..., X^) 
which is positively homogeneous wifh degree p' < p, fhen fhe following proposifion provides a 
simple counter-example demonsfrafing fhaf in general if is nol possible lo guarantee lhal a global 
minimum can be found from local descenf. 

Proposition 19 Let i : M be a convex function with di{0) 7^ 0; let x ... x 

—)• be a positively homogeneous mapping with degree p; and let G : x ... x —)■ 

M_i_ be a positively homogeneous function with degree p' < p such that G(0 ,..., 0 ) = 0 and 
G{X ^,..., X^) > 0 V{(X^,..., X^) : ..., X^) 7 ^ 0}. Then, the optimization problem 

given by 

min f{X\ ...,X^)= ^{^X\ ..., X^)) + G{X ^,..., (53) 

will always have a local minimum at {X ^,..., X^) = (0,..., 0). Additionally, V(X^,..., X^) 
such that ^{X^,... ,X^) 7 ^ 0 there exists a neighborhood such that f{eX^,... ,eX^) > 
/(O,... ,0)/or e ^ 0 ClTld STlTClll. 

Proof Consider f{eX ^,..., eX^) — /(O,..., 0). This gives 


^($(eX\ ..., eX^)) + G{eX\ eX^) 

-£(0)-G(0,...,0) = 

(54) 

£{eP^{X\ X^)) - £{0) + eP'G{X\ .. 

■,X^)> 

(55) 

eP {de{0), ^X\..., X^)) + €p'G{X\. .. 

.,X^) 

(56) 
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Recall that p > p' and ,X^) / 0 ^ G{X^,...,X^) > 0, so V(X\ ..., X^), 

f{eX^,... ,eX^) — /(O, ...,0) > 0 for e > 0 and sufficiently small, with equality iff 
G{X^,...,X^) = 0 ^>(X\...,X^) = 0, giving the result. ■ 


The above proposition shows that unless we have the special case where ... ,X^) = 
(0 ,..., 0) happens to be a global minimizer, then there will always exist a local minimum at the 
origin, and from the origin it will always be necessary to take an increasing path to escape the local 
minimum. The case described above, where p > p', is arguably the more common situation for 
mismatched degrees of homogeneity (as opposed to p < p'), and a typical example might be an 
objective function such as 

K 

imX^,...,X^)) + \^\\xY (57) 

i=l 

where <h is a positively homogeneous mapping with degree K > 2 (e.g., the mapping of a deep 
neural network) but p' is typcially taken to be only 1 or 2 depending on the particular choice of 
norm. 

Conversely, in the situation where p' > p, then it is often the case that the regularization function 
is not sufficient to ’limit’ the size of the factorization, in the sense that the objective function can 
always be decreased by allowing the size of the factors to grow. As a simple example, consider the 
case of matrix factorization with the objective function 

e{UV^) + X{\\uf + \\vf) (58) 

If the size of the factorization doubles, then we can always take [^U ^U][^V ^V]'^ = , 

so if {\\\U U]Y' + ||[17 F]||^') < \\UY' + ||17||P', then the objective function can always be 
decreased by simply duplicating and scaling the existing factorization. It is easily verified fhaf fhe 
above inequalify is safisfied for many choices of norm (for example, all fhe Iq norms wifh q > 1) 
when p' >2. As a resulf, fhis implies fhaf fhe degrees of freedom in fhe model will be largely 
dependenf on fhe inifial choice of fhe number of columns in {U,V), since in general fhe objecfive 
function is fypically decreased by having all enfries of {U, F) be non-zero. 


6.2 Implications for Neural Networks 


Examining our results specifically as they apply to deep neural networks, we first note that from 
our analysis we have shown that neural networks which are based on positively homogeneous map¬ 
pings can be regularized in the way we have outlined in our framework so that the optimization 
problem of training the network can be analyzed from a convex framework. Further, we suggest 
that our results provide a partial explanation to the recently observed empirical phenomenon where 
replacing the traditional sigmoid or hyperbolic tangent non-linearities with positively homogeneous 
non-linearities, such as rectification a nd max-pooling, s i gnificantly boosts t he speed of optirn i zation 


and the performan ce of the network (IDahl et al.L 1201 3l : 'Maas et al.L 1201 3l : iKrizhevskv et al.L l2012l: 


Zeiler et al.L 120130 . Namely, by using a positively homogeneous network mapping, the problem 
then becomes a convex function of the network outputs. Additionally, we have also shown that if 
the size of the network is allowed to be large enough then for any initialization a global minimizer 
can be found from purely local descent, and thus loc al minima are all equivalent. This is a similar 
conclusion to the work of IChoromanska et al.l (120141) . who analyzed the problem from a statistical 
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standpoint and showed that with sufficiently large networks and appropriate assumptions about the 
distributions of the data and network weights, then with high probability any family of networks 
learned with different initializations will have similar objective values, but we note that our results 
allow for a well defined sef of conditions which will be sufficienl fo guaranfee fhe properly. Fi¬ 
nally, many modem large scale nefworks do nol use fradilional regularization on fhe nelwork weigh 
parameters such as an Zi or I 2 norms during fraining and instead rely on alfemaf ive forms of reg- 


ularizafion such as dropouf a s if fends fo achie ve heifer performance in pracfice dSrivasfava ef af 


20141 : iKrizhevskv ef all l2012l : IWan ef all 120131) . Given our commenlary above regarding fhe crif- 


ical imporlance of balancing fhe degree of homogeneity belween fhe mapping and fhe regularizer, 
an immediafe prediclion of our analysis is lhal simply ensuring fhal fhe degrees of homogeneify are 
balanced could be a significanl faclor in improving fhe performance of deep nefworks. 

We conclude by noting fhal fhe main limilafion of our currenf framework in fhe confexl of fhe 
analysis of currenlly exisling slafe-of-fhe-arl neural nefworks is fhal fhe form of fhe mapping we 
sludy here (ITOl) implies fhal fhe nelwork archileclure musl consisl of r parallel subnelworks, where 
each subnelwork has a particular archileclure defined by fhe elemenlal m apping (t>. Previou s ly, we 


mentioned as an example lhal fhe well known ImageNel nelwork from (IKrizhevskv ef al.L l2012l) 


can be described by our framework by faking r = 1 and using an appropriate definition of </>; 
however, fo apply Corollary [16] fo fhen lesl for global oplimalily, we musf fesl whelher if is possible 
fo reduce fhe objective function by adding an entire nelwork wilh fhe same archileclure in parallel 
fo fhe given nelwork. Clearly, Ihis is a significanl limilafion for fhe applicafion of Ihese resulls and 
suggesfs Iwo possibililies for fulure work. The firsl is lhal simply implemenling neural nefworks 
wilh a highly parallel nelwork archileclure and relalively simple subnelwork archifeclures could 
be advanlageous and worthy of experimenlal sludy. In facl, fhe ImageNel nelwork already has a 
cerlain degree of parallelizalion as Ihe initial convolutional layers of Ihe nelwork operate largely in 
parallel on separate GPU unils. More generally, here we have focused on mappings wilh form (ITOl) 
as il is conducive lo analysis, bul we believe lhal many of fhe resulls we have presented here can be 
generalized lo more general mappings (and Ihus more general nelwork archileclures) using many of 
Ihe principles and analysis techniques we have presented here; an extension we reserve for fulure 
work. 


6.3 Conclusions 

Here we have presented a general framework which allows for a wide variety of non-convex fac¬ 
torization problems to be analyzed wilh tools from convex analysis. In particular, we have shown 
lhal for problems which can be placed in our framework, any local minimum can be guaranteed to 
be a global minimum of Ihe non-convex factorization problem if one slice of Ihe factorized tensors 
is all zero. Additionally, we have shown lhal if Ihe non-convex factorization problem is done wilh 
factors of sufficienl size, Ihen from any feasible inilializalion il is always possible to find a global 
minimizer using a purely local descenl algorilhm. 
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